208 research outputs found

    A Genealogical Interpretation of Principal Components Analysis

    Get PDF
    Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's fst and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference

    How Many Subpopulations is Too Many? Exponential Lower Bounds for Inferring Population Histories

    Full text link
    Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, like the seminal work of Li and Durbin (Nature, 2011), attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure -- the history of multiple subpopulations that merge, split and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations. Along the way and of independent interest, we essentially determine the optimal number of samples needed to learn an exponential mixture distribution information-theoretically, proving the upper bound by analyzing natural (and efficient) algorithms for this problem.Comment: 38 pages, Appeared in RECOMB 201

    Genomic signatures of population decline in the malaria mosquito Anopheles gambiae

    Get PDF
    Population genomic features such as nucleotide diversity and linkage disequilibrium are expected to be strongly shaped by changes in population size, and might therefore be useful for monitoring the success of a control campaign. In the Kilifi district of Kenya, there has been a marked decline in the abundance of the malaria vector Anopheles gambiae subsequent to the rollout of insecticide-treated bed nets. To investigate whether this decline left a detectable population genomic signature, simulations were performed to compare the effect of population crashes on nucleotide diversity, Tajima's D, and linkage disequilibrium (as measured by the population recombination parameter ρ). Linkage disequilibrium and ρ were estimated for An. gambiae from Kilifi, and compared them to values for Anopheles arabiensis and Anopheles merus at the same location, and for An. gambiae in a location 200 km from Kilifi. In the first simulations ρ changed more rapidly after a population crash than the other statistics, and therefore is a more sensitive indicator of recent population decline. In the empirical data, linkage disequilibrium extends 100-1000 times further, and ρ is 100-1000 times smaller, for the Kilifi population of An. gambiae than for any of the other populations. There were also significant runs of homozygosity in many of the individual An. gambiae mosquitoes from Kilifi. These results support the hypothesis that the recent decline in An. gambiae was driven by the rollout of bed nets. Measuring population genomic parameters in a small sample of individuals before, during and after vector or pest control may be a valuable method of tracking the effectiveness of interventions

    Recombination rate and selection strength in HIV intra-patient evolution

    Get PDF
    The evolutionary dynamics of HIV during the chronic phase of infection is driven by the host immune response and by selective pressures exerted through drug treatment. To understand and model the evolution of HIV quantitatively, the parameters governing genetic diversification and the strength of selection need to be known. While mutation rates can be measured in single replication cycles, the relevant effective recombination rate depends on the probability of coinfection of a cell with more than one virus and can only be inferred from population data. However, most population genetic estimators for recombination rates assume absence of selection and are hence of limited applicability to HIV, since positive and purifying selection are important in HIV evolution. Here, we estimate the rate of recombination and the distribution of selection coefficients from time-resolved sequence data tracking the evolution of HIV within single patients. By examining temporal changes in the genetic composition of the population, we estimate the effective recombination to be r=1.4e-5 recombinations per site and generation. Furthermore, we provide evidence that selection coefficients of at least 15% of the observed non-synonymous polymorphisms exceed 0.8% per generation. These results provide a basis for a more detailed understanding of the evolution of HIV. A particularly interesting case is evolution in response to drug treatment, where recombination can facilitate the rapid acquisition of multiple resistance mutations. With the methods developed here, more precise and more detailed studies will be possible, as soon as data with higher time resolution and greater sample sizes is available.Comment: to appear in PLoS Computational Biolog

    Sexual selection protects against extinction

    Get PDF
    Reproduction through sex carries substantial costs, mainly because only half of sexual adults produce offspring. It has been theorised that these costs could be countered if sex allows sexual selection to clear the universal fitness constraint of mutation load. Under sexual selection, competition between (usually) males, and mate choice by (usually) females create important intraspecific filters for reproductive success, so that only a subset of males gains paternity. If reproductive success under sexual selection is dependent on individual condition, which depends on mutation load, then sexually selected filtering through β€˜genic capture’ could offset the costs of sex because it provides genetic benefits to populations. Here, we test this theory experimentally by comparing whether populations with histories of strong versus weak sexual selection purge mutation load and resist extinction differently. After evolving replicate populations of the flour beetle Tribolium castaneum for ~7 years under conditions that differed solely in the strengths of sexual selection, we revealed mutation load using inbreeding. Lineages from populations that had previously experienced strong sexual selection were resilient to extinction and maintained fitness under inbreeding, with some families continuing to survive after 20 generations of sib Γ— sib mating. By contrast, lineages derived from populations that experienced weak or non-existent sexual selection showed rapid fitness declines under inbreeding, and all were extinct after generation 10. Multiple mutations across the genome with individually small effects can be difficult to clear, yet sum to a significant fitness load; our findings reveal that sexual selection reduces this load, improving population viability in the face of genetic stress

    Forward-time simulation of realistic samples for genome-wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Forward-time simulations have unique advantages in power and flexibility for the simulation of genetic samples of complex human diseases because they can closely mimic the evolution of human populations carrying these diseases. However, a number of methodological and computational constraints have prevented the power of this simulation method from being fully explored in existing forward-time simulation methods.</p> <p>Results</p> <p>Using a general-purpose forward-time population genetics simulation environment, we developed a forward-time simulation method that can be used to simulate realistic samples for genome-wide association studies. We examined the properties of this simulation method by comparing simulated samples with real data and demonstrated its wide applicability using four examples, including a simulation of case-control samples with a disease caused by multiple interacting genetic and environmental factors, a simulation of trio families affected by a disease-predisposing allele that had been subjected to either slow or rapid selective sweep, and a simulation of a structured population resulting from recent population admixture.</p> <p>Conclusions</p> <p>Our algorithm simulates populations that closely resemble the complex structure of the human genome, while allows the introduction of signals of natural selection. Because of its flexibility to generate different types of samples with arbitrary disease or quantitative trait models, this simulation method can simulate realistic samples to evaluate the performance of a wide variety of statistical gene mapping methods for genome-wide association studies.</p

    Pervasive Hitchhiking at Coding and Regulatory Sites in Humans

    Get PDF
    Much effort and interest have focused on assessing the importance of natural selection, particularly positive natural selection, in shaping the human genome. Although scans for positive selection have identified candidate loci that may be associated with positive selection in humans, such scans do not indicate whether adaptation is frequent in general in humans. Studies based on the reasoning of the MacDonald–Kreitman test, which, in principle, can be used to evaluate the extent of positive selection, suggested that adaptation is detectable in the human genome but that it is less common than in Drosophila or Escherichia coli. Both positive and purifying natural selection at functional sites should affect levels and patterns of polymorphism at linked nonfunctional sites. Here, we search for these effects by analyzing patterns of neutral polymorphism in humans in relation to the rates of recombination, functional density, and functional divergence with chimpanzees. We find that the levels of neutral polymorphism are lower in the regions of lower recombination and in the regions of higher functional density or divergence. These correlations persist after controlling for the variation in GC content, density of simple repeats, selective constraint, mutation rate, and depth of sequencing coverage. We argue that these results are most plausibly explained by the effects of natural selection at functional sitesβ€”either recurrent selective sweeps or background selectionβ€”on the levels of linked neutral polymorphism. Natural selection at both coding and regulatory sites appears to affect linked neutral polymorphism, reducing neutral polymorphism by 6% genome-wide and by 11% in the gene-rich half of the human genome. These findings suggest that the effects of natural selection at linked sites cannot be ignored in the study of neutral human polymorphism

    An Approximate Bayesian Estimator Suggests Strong, Recurrent Selective Sweeps in Drosophila

    Get PDF
    The recurrent fixation of newly arising, beneficial mutations in a species reduces levels of linked neutral variability. Models positing frequent weakly beneficial substitutions or, alternatively, rare, strongly selected substitutions predict similar average effects on linked neutral variability, if the product of the rate and strength of selection is held constant. We propose an approximate Bayesian (ABC) polymorphism-based estimator that can be used to distinguish between these models, and apply it to multi-locus data from Drosophila melanogaster. We investigate the extent to which inference about the strength of selection is sensitive to assumptions about the underlying distributions of the rates of substitution and recombination, the strength of selection, heterogeneity in mutation rate, as well as the population's demographic history. We show that assuming fixed values of selection parameters in estimation leads to overestimates of the strength of selection and underestimates of the rate. We estimate parameters for an African population of D. melanogaster (ŝ∼2Eβˆ’03, ) and compare these to previous estimates. Finally, we show that surveying larger genomic regions is expected to lend much more discriminatory power to the approach. It will thus be of great interest to apply this method to emerging whole-genome polymorphism data sets in many taxa

    Discovery of Rare Variants via Sequencing: Implications for the Design of Complex Trait Association Studies

    Get PDF
    There is strong evidence that rare variants are involved in complex disease etiology. The first step in implicating rare variants in disease etiology is their identification through sequencing in both randomly ascertained samples (e.g., the 1,000 Genomes Project) and samples ascertained according to disease status. We investigated to what extent rare variants will be observed across the genome and in candidate genes in randomly ascertained samples, the magnitude of variant enrichment in diseased individuals, and biases that can occur due to how variants are discovered. Although sequencing cases can enrich for casual variants, when a gene or genes are not involved in disease etiology, limiting variant discovery to cases can lead to association studies with dramatically inflated false positive rates

    Genetic Crossovers Are Predicted Accurately by the Computed Human Recombination Map

    Get PDF
    Hotspots of meiotic recombination can change rapidly over time. This instability and the reported high level of inter-individual variation in meiotic recombination puts in question the accuracy of the calculated hotspot map, which is based on the summation of past genetic crossovers. To estimate the accuracy of the computed recombination rate map, we have mapped genetic crossovers to a median resolution of 70 Kb in 10 CEPH pedigrees. We then compared the positions of crossovers with the hotspots computed from HapMap data and performed extensive computer simulations to compare the observed distributions of crossovers with the distributions expected from the calculated recombination rate maps. Here we show that a population-averaged hotspot map computed from linkage disequilibrium data predicts well present-day genetic crossovers. We find that computed hotspot maps accurately estimate both the strength and the position of meiotic hotspots. An in-depth examination of not-predicted crossovers shows that they are preferentially located in regions where hotspots are found in other populations. In summary, we find that by combining several computed population-specific maps we can capture the variation in individual hotspots to generate a hotspot map that can predict almost all present-day genetic crossovers
    • …
    corecore